Explanatory Data Analysis of White Wine by Adam MacDonald

Univariate Plots Section

## [1] 4898   13
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

## [1] "3" "4" "5" "6" "7" "8" "9"

I first want to start with the output variable and get a visual of its’ distribution and some summary statistics for the data set. The histogram shows a somewhat normal distribution with a small range (3-9) and only consisting of integers. The most common quality score for white wine in this data set is 6, followed by 5. With the output variable being represented by integers from 0 to 10 distinguishing wine quality, I feel this variable should be represented as a factor, since the 0 to 10 scale is arbitrary and could have easily been represented on a scale of ‘A to K’.

Creating a histogram of fixed.acidity shows a close to normal distribution, but I did notice a couple observations in the right tail of the chart at values of approximately 12 and 14. I chose to create a boxlpot for the frequency of fixed.acidity to see how far from the IQR these, and other, observations fall to get an idea of potential outliers.

The volatile.acidity distribution shows a long right tail, displaying a positive skewness. This is also seen in the frequency boxplot of volatile.acidity with all of its’ observations outside the 1.5 * IQR lying on the right side of the distribution. I’m going to transform the data using a log scale layer to try and normalize the distribution, and the result seems to follow a much closer to normal distribution than the non transformed variable, but still has multiple outlier in the right tail. I then tried a sqrt transformation to see if that would be a better normalize the distribution, but it turned out to be worse than the log10 transformation. I’m curious as to which type of acidity, fixed or volatile, has more impact on the quality score of the wine and will investigate this later on in the analysis

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

Creating a histogram grid for the remaining variables as they are not as instinctively linked as fixed.acidity & volatile.acidity, where I wanted to closely observe their distributions to distinguish between the two. I chose to use this function to create a grid that includes a fitted normal curve based on the distribution, as well as the density curve which is an alternative view of the variables distribution, in addition to the standard histogram. Based on the charts produced, it is evident that there are a number of variables that likely have outliers from the length of the x-axis in relation to the bars of the histogram. For example, free.sulfur.dioxide shows the last identifiable bar of the histogram just before 100, but the line extends to 300, suggesting there is an observation that’s far from the rest of the data and upon investigation this is the case as there is an entry at 289. The purpose of this was for me to get a quick idea of what the remaining variables looked like before continuing into further analysis.

By changing the scale to omit the top 1% of observations of free.sulfur.dioxide, that was causing a long tail in the distribution (upper left), the underlying distribution of free.sulfur.dioxide can be observed (bottom left). The boxplot in the upper right shows the single extreme observation at 289, as well as other outliers. When removing the top 1% to view the distribution the boxplot result improves much like the histogram. This approach is similarly applied to other variables in grid above that displayed the data on the left portion of the charts, to identify their underlying distributions.

Not all histograms for the variables were affected the same way as free.sulfur.dioxide by the extreme cases. Residual.sugar still doesn’t appear to be a normal distribution when omitting the top 1% of observations. I’m going to transform the data using a log transformation. When applying the log transformation to the variable, it appears that the distribution is actually bi-modal with peaks just over 1 and around 8.

Univariate Analysis

What is the structure of your dataset?

The data set consists of 4,898 observations and 13 variables. The first variable is an index for each observation of white wine. The last variable is the output variable and is an integer quality score from 0 to 10 (0 being the worst and 10 being the best). The remaining variables are all explanatory variables and stored as numeric data types.

I changed the quality output variable from an integer data type to a factor. A quality score of 6 is the most common, with 3 and 9 as the lowest and highest quality scores observed.

What is/are the main feature(s) of interest in your dataset?

The main feature of the data set is the wine quality. I will look to determine which variable(s) affect the quality of the wine the most.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Based on the description of the data set provided, with these variables potentially affecting the taste of the wine, I would suspect that volatile.acidity (vinegar taste), citric.acid (freshness), and total.sulfur.dioxide (scent).

Did you create any new variables from existing variables in the dataset?

No, I did consider adding bound.sulfur.dioxide by taking the delta of the total&free sulfur dioxide, but decided it likely wouldn’t contribute much to the analysis as it would just be the inverse of the free.sulfur.dioxide variable.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The volatile.acid variable, which I suspect as a strong predictor of quality, had a positive skewed distribution and after applying a couple transformations the log10 transformation appeared to normalize the distribution best so that I can identify linear relationships with other variables based on this transformation.

The histogram grid showed many variables with extremely long tails in relation to the majority of their data, which led me to observe the distributions when removing the top 1% of observations for those respective variables.

Of the iterative charts that I included in the above analysis, residual.sugar showed an interesting distribution that consisted of a very long range, with the most observations at the low end of the range (even when removing top 1% of observations). I applied a log transformation to this and observed what appears to be a bi modal distribution that was otherwise not noticeable.

Bivariate Plots Section

From the boxplot relation between quality and alcohol, it appears the alcohol % is a good predictor of the quality score. I will first look into this relation. There are also a number of variables that this matrix shows are correlated with one another that I will observe.

I chose to analyze the relationship with first a scatterplot with degrees of transparency and chose to stick with an alpha of 1/20. If I chose any larger the points in 3 & 9 were almost unobservant and a smaller alpha led to quality scores 5-7 having not distinguishable trends. Even with this alpha selection I felt that adding a jitter component to disperse the data from the line in the previous chart would better display the underlying trend. This effectively displays an increase to quality of the wine as the alcohol % increases. Quality scores of 3 & 9 don’t have many data observations to draw from. The box plot chart is also an effective way of displaying this relationship.

I also noticed that density seemed to be well correlated to quality, when removing the top 1% from density. As density decreases, quality appears to increase, showing a negative correlation between density and quality.

It appears that of the 3 variables I was expecting to have the largest impact on quality, based on the data set description involving taste association for these variables, none really show a distinguishable trend in relation to quality.

## 
##  Pearson's product-moment correlation
## 
## data:  wine$density and wine$alcohol
## t = -87.2549, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7908646 -0.7689315
## sample estimates:
##        cor 
## -0.7801376

Knowing that alcohol and density are predictors of quality, it is useful to know whether these two variables are correlated with one another, or other variables, and the types/strength of correlation. Using the ggplot matrix, density clearly displays the strongest correlation with alcohol, equal to -0.78. After revisiting the data set information sheet this should come as no surprise since the description of the attribute density states “the density of water is close to that of water depending on the percent alcohol and sugar content”. Based on this I will verify there is some sort of correlation between residual.sugar and density as well.

## 
##  Pearson's product-moment correlation
## 
## data:  wine$density and wine$residual.sugar
## t = 107.8749, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8304732 0.8470698
## sample estimates:
##       cor 
## 0.8389665

Indeed when subsetting the variables to exclude their top 1% of observations there is a strong correlation between these two variables.

There appears to be a weak positive correlation between density and total.sulfur.dioxide.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

The main relationship observed from this investigation was the relationship between alcohol percentage and quality. Seeing as the ultimate goal of this analysis is to distinguish which features affect quality the most, this was the one that stood out. Density also showed correlation with quality, however it is also correlated with alcohol and will need to be analyzed with alcohol and quality in the multivariate analysis. Not only did alcohol vary based on density, but it also was impacted by residual.sugar, since sugar and alcohol percentage are both strongly correlated with density.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Free.sulfur.dioxide and total.sulfur.dioxide, as expected, showed a strong correlation with one another as free SO2 is a component of total SO2 in the wine.

What was the strongest relationship you found?

The strongest relationship found was between density and residual.sugar, with a correlation of 0.839, which was explored in the analysis above.

Multivariate Plots Section

I want to first observe the relationship between the three correlated variables observed in previous analysis, residual.sugar, alcohol, and density. This chart shows the relationship, but I don’t particularly like density as the color scale variable, as it has a small range. Will look at alternating variable parameters of this chart.

switched alcohol and density from previous chart and I feel this paints a better picture of the relationships. In previous analysis we saw it useful to observe residual.sugar on a log scale so I will add this in the next chart.

After adding the log scale for residual.sugar the cluster of data around 1 is now dispersed and unhides the data, showing a larger degree of variance when residual.sugar is low. As residual.sugar and density increase with each other, the alcohol percentage seems to decrease, as well as the variance between density and residual.sugar. Knowing that alcohol is positively correlated to quality, I would expect that swapping out alcohol with quality in this chart we would see a similar trend, where quality will decrease as residual.sugar and density increase.

This chart shows the same relation as alcohol % with the majority of lower quality wines (green and brown dots) in the upper portion of the chart, and the higher quality wines falling, for the most part, below those where density and/or residual.sugar are low.

Using a smoothing factor, and observing quality in relation to the density median by residual.sugar, it is evident that as density increases, quality decreases and in all cases the density increases with residual.sugar increasing.

Not much insight gained from looking at the relationship between alcohol, total.sulfur.dioxide, and residual.sugar. Trying to identify other drivers of alcohol or quality.

This investigation shows a log relationship between alcohol and chlorides, but the variable total.sulfur.dioxide really doesn’t add any value to the chart.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

The multivariate relationship I observed was between density, residual.sugar, and alcohol/quality. I used density and residual.sugar to first identify those two variable’s relation with alcohol, then used those same two variables to verify that they had the same affect on quality, knowing that alcohol and quality are positively correlated with each other.

Were there any interesting or surprising interactions between features?

I did not identify any other variables that interacted with one another on the multivariate level to provide any further insight to the featured variables. Some of the variables had little to no correlation with any of the other variables, for example sulphates’s strongest correlation to any variable was 0.16 with pH, which is hardly a correlation.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

NA


Final Plots and Summary

Plot One

Description One

The distribution of white wine quality scores resembles a normal distribution, with the median score of 6, a range from 3 to 9, and the majority of wines falling between a score of 5 and 7.

Plot Two

Description Two

Wines with higher alcohol percentages tend to result in a higher quality score. The median alcohol percentage, and quartiles, increase for each quality score from 5 to 9.

Plot Three

Description Three

Alcohol percentage in white wine decreases as density and residual sugar increase. Since the alcohol percentage of white wine is the best predictor of the quality score associated with the wine, the relationship between quality score, density, and residual sugar is also apparent where quality score decreases as density and residual sugar increase.


Reflection

I found a couple aspects of this data set which resulted in difficulties for my analysis. The first was the output variable essentially being a categorical variable, even though it is represented on a number scale. This limited some of the chart options for analysis involving this variable of interest. Another difficulty with the data set was the lack of categorical explanatory variables, which would have been useful in subsetting the data to identify further relations. The final struggle I had with the data was the minimal amounts of correlation between the variables in the data set, where some variables weren’t correlated with any variables at all. I found success in the analysis when I looped back to the data set information sheet and applied the knowledge of the variables to the scatterplot matrix to identify which variables should (and verify) be correlated with one another. This allowed me to collect the most influential variables to understand what drives the quality score. This analysis could be enriched with some additional variables collected and added to the data set. The inclusion of price, origin of the wine (geographic) and the type of wine or grape used for the wine, which would also accomplish adding categorical explanatory variables to the data.